MICROCOPY  RESOLUTION  TEST  CHART 


SECURITY  CLASSIFICATION  or  THIS  PAGE  (tt'hmn  f)«l«  Entered) 

I  REPORT  DOCUMENTATION  PAGE 


n  nPPOllJtUMBER 


[2  GOVT  ACCESSION  NO. 


7^?-18 


Aq-Ao( 


& 

(M 

O* 


UFVA,  A  Combined  Linear  and  Nonlinear  Factor  j 
Analysis  Program  Package  for  Chemical  Data  / 
Evaluation.  • 


^ — kold  r.vsTRuerroNS 
BF.FOREVCOMPLSTING  FORM 

iiPTSNT’l  (CATALOG  NUMBER 


~TVfE  Of  REPORT  ft  PERIOO  COVERED 

Technical  Report  -  Interim 
8/80  -  11/80 

PERFORMING  ORG.  REPORT  NUMBER 


jqjMDERf.J 


Clemens  /.Jochum 
Bruce  R.j  Kowalski 


i  *•  ,  V.'-  C -^5 36  V 


*.  PERFORMING  ORGANIZATION  NAME  AND  AOORESS 

Laboratory  for  Chemometrics {  Department  of  Chem¬ 
istry,  University  of  Washington,  Seattle, 
Washington  98195 

I  I.  CONTROLLING  OFFICE  NAME  AND  ADDRESS  7 

Materials  Sciences  Division  (  ! 

Office  of  Naval  Research 


PROGRAM  ELEMENT.  PROJECT,  TASK 
AREA  ft  WORK  UNIT  NUMBERS 


NR  051-565 


MONITORING  AC 


AM£.«AOpRESS<Jf  dMormtt  from  Controlling  Otttc*)  IS.  SECURITY  CLASS,  (of  thlm  report) 


,£*n 


UNCLASSIFIED 


DECLASSIFICATION/ DOWNGRADING 
SCHEDULE 


[16.  DISTRIBUTION  ST  A  TEM  €HT(o  (  thim  Report) 


Approved  for  public  release;  distribution  unlimited 


I  IT.  DISTRIBUTION  STATEMENT  (el  lh*  abtiracl  ml orod  In  Block  30,  II  dllleronl  tram  Report) 


|  IS.  SUPPLEMENTARY  NOTES 


Prepared  for  publication  in  Analytica  Chimica  Acta 


\f** 


It.  KEY  WOROS  (Contlmim  on  roworoo  old*  II  notoooorf  md  Identity  by  block  r.umbor) 

underlying  variable  factor  analysis 
factor  loading  matrix;  factor  weight  matrix 


NOV  2  8  1980 


£u 


JO.  ABSTRACT  fContlnu*  an  roetcto  tide  II  neceeeery  ond  Identify  by  Hock  number ) 

UFVA,  an  underlying  variable  factor  analysis  program  is  described.  The  theories 
of  principal  component  analysis  and  nonlinear  least  squares  projection 
techniques  are  outlined  and  compared.  Several  applications  from  various 
chemical  fields  arc  presented  which  show  that  a  complete  analysis  of  the 
underlying  structure  and  dimensionality  of  a  chemical  data  set  should  always 
include  these  nonlinear  projection  techniques. k 


OFFICE  OF  NAVAL  RESEARCH 


Contract  N00014-75-C-0536 
Task  No.  NR  OS 1-565 
TECHNICAL  REPORT  NO.  18 

UFVA,  A  Combined  Linear  and  Nonlinear  Factor  Analysis 
Program  Package  for  Chemical  Data  Evaluation 

by 

Clemens  Jochum  and  Bruce  R.  Kowalski 
Prepared  for  Publication 
in 

Analytica  Chimica  Acta 


Laboratory  for  Chemometrics 
Department  of  Chemistry  BG-10 
University  of  Washington 
Seattle,  Washington  98195 


November  1980 


Reproduction  in  whole  or  in  part  is  permitted  for 
any  purpose  of  the  United  States  Government 


This  document  has  been  approved  for  public  release 
and  sale;  its  distribution  is  unlimited 


■  i'VI 

GPA> 
TAB 
r.oun'  ■ 
if  ic- 


2 


SUMMARY 

UVFA,  an  underlying  variable  factor  analysis  program  is  described.  The 
theories  of  principal  component  analysis  and  nonlinear  least  squares  projection 
techniques  are  outlined  and  compared.  Several  applications  from  various 
chemical  fields  are  presented  which  show  that  a  complete  analysis  of  the 
underlying  structure  and  dimensionality  of  a  chemical  data  set  should  always 
include  these  nonlinear  projection  techniques. 
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INTRODUCTION 

Multivariate  statistics,  originally  developed  for  applications  in  social 
sciences,  have  been  more  and  more  applied  to  chemical  data  evaluation.  In  fact, 
the  statistical  treatment  of  chemical  data  became  a  whole  new  branch  of  analytical 
chemistry,  called  Chemometrics  [1], 

One  of  the  most  powerful  methods  in  chemometrics  which  has  been  applied  as 
a  "stand-alone"  method  as  well  as  in  combination  with  other  methods  is  j>rincipal 
£omponent  factor  analysis  [2]  PCA.  Applications  range  from  data  reduction 
problems,  interpretation  of  the  underlying  structure  of  a  data  set  to  a  pre¬ 
liminary  treatment  of  the  data  bases  for  a  path  modelling  analysis  [3].  PCA  has 
been  applied  to  e.g.  mass  spectral  and  environmental  data,  NMR  and  chromatography 
data  [4]. 

PCA  assumes  a  linear  relation  among  the  variables.  In  nature,  however, 
most  relations  between  physical  parameters  or  variables  are  nonlinear.  To 
overcome  this  setback  of  linear  factor  analysis,  algorithms  such  as  nonlinear 
least  squares,  multidimensional  scaling  [5]  and  parametric  mapping  [6]  for  the 
analysis  of  the  underlying  nonlinear  structure  of  a  data  base  have  been  developed. 
So  far,  there  have  been  no  applications  published  of  these  nonlinear  methods 
to  chemical  data  analysis.  In  the  next  section  the  theory  of  the  different 
lineaT  and  nonlinear  methods  is  explained. 

In  the  following  an  interactive  program  package  is  described  which  includes 
not  only  principal  component  factor  analysis  and  rotational  methods,  but  also 
nonlinear  least  squares  projection  techniques  such  as  multidimensional  scaling, 
nonlinear  and  parametric  mapping  and  graphical  output  routines.  The  algorithms 
and  the  program  are  demonstrated  on  two  chemical  data  sets. 


THEORY 


The  underlying  relation  among  n  measurements  (e.g.  melting  point,  dipole 

moment,  etc.)  of  a  data  matrix  2  *  (Z..)  *  =  l,...,m  consisting  of  m  samples 

J  j  =  l,...,n 

is  to  be  analyzed. 

To  give  the  measurements  equal  weight,  they  are  usually  scaled  to  unit 
variance  and  zero  mean. 

In  a  three  variable  data  set  (n  =  3)  the  measurement  vectors  can  be 
represented  graphically  in  a  three-dimensional  space  (Fig.  1). 

Figure  1 


Factor  analysis  determines  the  dimensionality  of  the  hyperspace  necessary 

to  represent  the  data.  The  first  factor  X^  is  represented  by  the  longest  axis 

of  the  hyperspace  containing  the  data,  i.e.  it  represents  the  largest  amount 

of  variance  in  one  dimension.  The  second  vector  X 2  is  represented  by  the 

second  longest  axis  orthogonal  to  the  first  one  and  so  on.  To  obtain  the  t 

factors  necessary  to  represent  most  of  the  total  variance  of  the  data  set, 

the  data  matrix  Z  is  decomposed  into  a  factor  weight  matrix  (factor  loading 

matrix)  A  =  (a.)  i  =  l,...,n  and  a  factor  score  matrix  P  «  (p..)  i  =  l,...,m 
1  j  *  l,...,r  1J  j  =  l,...,r 

(r<n) : 


The  columns  of  A  are  determined  by  calculating  the  eigenvectors  of  the 
data  covariance  matrix  C 
T 

C  *  Z1  Z. 

The  entries  a^  of  the  factor  loading  matrix  A  can  be  considered  as  the  mul¬ 
tiple  correlation  coefficients  of  the  variable  i  with  the  factor  j. 

The  factor  score  matrix  represents  the  data  in  terms  of  factor  coordinates 
and  is  calculated  according  to 
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P  *  Z  A. 

This  transformation  is  known  as  the  Karhunen-Loeve  expansion  [7]. 

As  mentioned  above,  the  relation  between  physical  parameter  variables  is 
not  always  linear.  It  is,  for  example,  possible  that  the  data  lie  along  a 
curved  line  or  surface  (Fig.  2). 

Figure  2 

Linear  principal  component  factor  analysis  would  still  come  up  with  three 
factors  since  the  variance  for  all  possible  three  dimensional  orthogonal  coordi¬ 
nate  systems  is  greater  than  zero  in  any  coordinate  direction.  On  the  other 
hand,  there  are  obviously  only  two  underlying  nonlinear  independent  variables. 

To  solve  this  problem,  nonlinear  least  squares  projection  methods  have 
been  developed  [5,6]. 

The  distances  (d„)  i,j  *  l,...,m  between  all  m  data  vectors  of  Z  are 
calculated.  The  data  points  are  then  arranged  in  an  r-dimensional  space 
(r<n)  in  a  way  that  the  stress  S 

S  = 

is  minimal  [5].  The  denote  the  recalculated  distances  of  the  data  points 
in  the  lower  r-dimensional  space.  The  stress  S  thus  renresents  a  measure  of  the 
goodness  of  fit  of  the  data  vectors  projected  in  the  r-dimensional  space  compared 
with  their  configuration  in  the  original  n  dimensional  space. 

The  different  projection  techniques  differ  mainly  by  a  different  measure 
for  the  goodness  of  fit.  To  demonstrate  the  different  applications  for  FCA, 
multidimensional  scaling  [5]  and  parametric  mapping  [6],  their  optimum 
theoretical  results  on  three  different  two  dimensional  data  sets  (I,  II,  III) 
are  shown  (Fig.  3) 


Figure  3 
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The  parametric  mapping  algorithm  is  able  to  determine  a  ring  shaped  one 
dimensional  structure  of  the  data  since  it  considers  only  local  environments  of 
data  points.  Since  this  method  does  not  look  at  the  global  fit  of  all  data 
points  it,  however,  sometimes  ends  up  with  a  too  small  dimensionality. 

Although  there  exist  programs  for  these  nonlinear  least  squares  methods, 
they  are  not  set  up  for  chemical  data  bases.  They  are  not  input  compatible  with 
each  other  and  they  work  only  as  batch  programs.  Since  these  programs  only 
include  either  multidimensional  scaling  or  parametric  mapping  and  no  linear 
factor  analysis  program,  there  was  a  definite  need  for  a  combined  package. 

Such  a  combined  underlying  variable  factor  analysis  program  is  described  in 
the  next  section. 
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THE  PROGRAM  UVFA 

The  underlying  variable  factor  analysis  program  L'VFA  [9]  consists  of  a 
driver  routine,  a  set  of  utility  routines  and  21  major  subroutines  which  perform 
the  actual  data  analysis.  Since  the  data  are  stored  on  disk  files,  only  the 
driver  routine  and  the  utility  routines  have  to  stay  in  core  during  the  whole 
run.  The  21  major  subroutines  can  be  loaded  one  at  a  time.  Thus  the  program 
usually  needs  less  than  60g  K  words  of  core  to  run  although  it  consists  of  more 
than  10,000  statements. 

The  input,  output  and  internal  binary  files  are  fully  compatible  with  our 
pattern  recognition  program  ARTHUR  [8]. 

UVFA  can  be  run  interactively  or  in  batch  mode  and  has  graphical  output 
routines  for  Tektronix  4010/4014  terminals,  Calcomp  plotter  or  line  printer. 
Figure  4  shows  the  general  setup  of  the  program. 

Figure  4 

PRICO  does  a  principal  component  analysis  with  or  without  communal ity 
iteration  [2],  MULSCA  and  PARAMA  are  the  nonlinear  least  squares  projection 
routines  for  multidimensional  scaling  and  parametric  mapping.  The  underlying 
linear  and  nonlinear  factors  can  be  plotted  with  the  routines  PRIPL0  (line 
printer  plot),  CALPL0  (Calcomp  plot)  or  TEKPL0  (Tektromix  graphics  terminal 
plot).  For  additional  error  analysis  the  linear  factors  can  be  backtrans formed 
by  calling  KATRAN  (Karhunen-Loeve-Transformation)  and  BACKTR.  The  program  also 
assists  with  the  interpretation  of  the  factors  by  calling  ANALYS  (ordering  of 
the  factors  and  loadings  and  performing  various  tests  for  finding  the  intrinsic 
dimensionality),  HIER  (performing  a  hierarchical  cluster  analysis),  ROTOR  and 
R0TSUB  for  performing  various  kinds  of  rotations. 

There  exist  versions  for  60  bit  CDC  computers  and  32  bit  DEC  VAX  computers. 
The  VAX  version  should  be  well  compatible  with  other  DEC  and  IBM  computers.  The 
whole  program  is  written  in  FORTRAN. 


APPLICATIONS 


Among  the  various  applications,  three  are  discussed  in  more  detail:  A  mass 
spectral  data  set,  a  constitutional  similarity  set  of  chemical  compounds  and  a 
data  set  of  physical  parameters  of  biologically  interesting  compounds. 

The  first  data  set  consists  of  the  mass  spectra  of  11  mono  and  sesqui¬ 
terpenes  [10].  These  are  Isoprene  (1),  Myrcene  (2),  p-Cymene  (3),  B-Pinene  (4), 
Camphene  (5),  Limonene  (6),  a-Cedrene  (7),  Caryophyllene  (8),  8-Selinene  (9), 
Santene  (10),  6-Cadinene  (11).  Figure  5  shows  the  plot  of  the  loadings  of  the 
first  two  vectors  of  the  factor  weight  matrix. 

Figure  5 

These  two  factors  encounter  97%  of  the  total  variance.  We  see  two  clusters 
of  compounds;  only  compound  8  seems  to  lie  somewhat  in  between.  It  turns  out 
that  one  cluster  consists  of  the  monoterpenes  and  Isoprene;  the  second  is  of 
the  sesquiterpenes.  Compound  8  (Caryophyllene)  should  therefore  belong  to  the 
second  cluster  (see  below) .  Since  the  first  factor  encounters  already  94%  of 
the  total  variance  there  is  clearly  one  main  factor,  i.e.  there  is  one  main 
underlying  fragmentation  pattern. 

The  nonlinear  multidimensional  scaling  configuration  of  our  data  in  two 
dimensions  shows  the  separation  of  the  two  clusters  very  clearly  (Fig.  6). 

Figure  6 

The  very  similar  fragmentation  pattern  of  Isoprene  and  the  monoterpenes  is 
reflected  by  their  close  neighborhood  within  the  cluster.  The  one  dimensional 
multidimensional  scaling  of  the  mass  spectra  (Fig.  7)  corroborates  that  there 
is  mainly  one  underlying  fragmentation  pattern:  The  stress  of  the  one  dimensional 
projection  is  almost  as  low  as  for  two  dimensions  [11]  (0.0031  and  0.008 
respectively)  and  thus  the  intrinsic  dimensionality  is  most  likely  one. 


Figure  7 


In  our  second  example,  the  data  base  consists  of  a  distance  matrix 
D  =  (d„)  *  1,...,13  of  another  set  of  13  terpene  components.  These  are 

Isoprene,  four  monoterpenes  (MyTcene,  Menthol,  Camphene,  Umbellulone) ,  four 
sesquiterpenes  (Bisabolene,  a-Cadinol,  Eudesmol,  Partheniol),  three  Diterpenes 
(Dextropimaric  Acid,  Phyllocladene,  Roylleanone)  and  one  Triterpene  (3-Amyrin) 

[12].  The  distance  measure  d^  is  the  minimum  chemical  distance  [13]  between 
the  compounds  i  and  j.  It  indicates  the  constitutional  similarity  between  two 
compounds.  To  perform  a  principal  component  analysis,  a  covariance  matrix  C 
is  generated  from  the  distance  matrix  [14] : 

d.  .  =  2  (1  -  C. .)1/2 
i]  ij 

We  again  get  two  linear  factors  (85.8%  and  13.1%  partial  variances)  and  a 
plot  of  their  loadings  (Fig.  8)  shows  that  there  are  no  particular  clusters  of 
compounds . 

Figure  8 

For  further  interpretation  we  look  at  the  two  dimensional  nonlinear 
projection  (Fig.  9). 

Figure  9 

The  compounds  are  now  clustered  according  to  whether  they  are  Mono,  Sesqui,  Di¬ 
or  Triterpenes.  Again  the  stress  for  a  one  dimensional  projection  is  almost 
as  low  as  for  two  dimensions  (0.0087  and  0.00859,  respectively)  which  suggests 
that  all  these  compounds  are  built  by  one  major  structural  element,  the  isoprene 
unit.  This  is  indicated  by  the  ordering  of  the  compounds  along  the  one  dimensional 
projection  axis  (not  shown)  according  to  their  number  of  isoprene  units. 

The  third  example  corroborates  some  results  of  a  linear  factor  analysis 
(principal  component  analysis)  performed  by  R.  D.  Cramer  on  a  data  set  of  10 


physical  parameters  of  44  organic  compounds  [15].  Cramer  obtains  two  linear 
factors  with  75.5%  and  21%  partial  variance.  A  nonlinear  projection  of  the 
compounds  in  a  two  dimensional  space  shows  no  distinct  clusters  of  the  compounds 
(Fig.  10). 

Figure  10 

Since  the  data  points  are  not  lying  along  a  line  but  they  are  spread  almost 
equally  in  both  directions,  the  intrinsic  dimensionality  seems  to  be  two.  An 
attempt  to  project  the  data  in  a  one  dimensional  space  results  in  a  very  inter¬ 
esting  pattern  of  a  plot  of  the  calculated  distances  versus  the  original 
d^  (see  above)  of  the  configuration  (Fig.  n). 

Figure  ll 

Although  most  of  the  distance  points  lie  close  to  the  diagonal  which  indicates 
that  most  of  the  compounds  can  be  fairly  well  fit  in  one  dimension,  some  of 
them  lie  almost  along  an  axis  perpendicular  to  the  diagonal.  This  most  likely 
indicates  that  there  is  one  major  and  one  minor  nonlinear  factor  in  the  two 
dimensional  space.  This  is,  however,  subject  to  further  investigation. 


CONCLUSION 


The  examples  show  that  the  combination  of  principal  component  analysis  with 
nonlinear  least  squares  projection  techniques  is  a  very  powerful  tool  for  the 
determination  of  the  intrinsic  dimensionality  and  the  interpretation  of  the 
factors.  Our  underlying  variables  factor  analysis  program  UVFA  [9]  provides 
a  convenient  way  for  a  complete  factor  and  nonlinear  projection  analysis  of 
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FIGURE  CAPTIONS 


Figure  1:  Three  dimensional  representation  of  measurement  vectors. 

Figure  2:  Three  dimensional  data  points  lying  on  a  nonlinear  surface. 

Figure  3:  Comparison  of  Principal  Component,  Multidimensional  Scaling  and 

Parametric  Mapping  Analysis. 

Figure  4:  General  Schematic  of  the  UVFA  program. 

Figure  5:  Plot  of  the  loadings  of  the  first  two  factors  of  the  terpene 

mass  spectral  data. 

Figure  6:  Two  dimensional  nonlinear  projection  of  terpene  mass  spectral  data. 

Figure  7:  One  dimensional  projection  of  terpene  mass  spectral  data. 

Figure  8:  Factor  loadings  of  minimum  chemical  distance  data  of  13  terpenes. 

Figure  9:  Two  dimensional  projection  of  the  minimum  chemical  distance  terpene 
data. 

Figure  10:  Two  dimensional  nonlinear  projection  of  10  physical  parameters/ 

44  compound  data. 

Figure  11:  Original  versus  calculated  distances  of  the  one  dimensional  optimum 
configuration  of  the  10  physical  pararoeters/44  compound  data. 
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